We have been having a horrendous time troubleshooting and fixing problems with our domino infrastructure the past month. We're not sure what has happened, but our best guess is some sort of gnome/elf/troll as infested our system and is causing random problems ;-)
It started with a client we launched on our servers May 1. We imported 2 MS SQL databases into an existing notes database (one of our clients merged with two other areas and all 3 signed a contract with us). Everything looked fine, then we launched the users on that database and they reported nothing but trouble. Database opens were taking hours (anywhere between 10 minutes and 4 hours), saves were taking 1-2 minutes, document opens were taking 2-5 minutes. We had launched similar database designs to other clients with databases over 5 gigs in size and didn't have this much trouble - this database was approx 1.4 gigs before any views were built.
We analyzed all the code new we wrote for the client but couldn't find anything suspicious or wrong.
On top of all this, on the same day, May 1, we launched a high-profile website for our client that received lots of positive press - we were getting millions of hits a day.
The CPU usage jumped to 100%, we assumed this was because of the large number of web users. We quickly built a new server and launched it at our co-located data center, leaving the original one as the webserver. The new server became our primary notes server for our clients.
Once we did this, the performance on the web server (the original one) went to approx 40% average CPU usage - clearly the websites werent the problem. The new server we built with the client databases had 100% CPU usage and was unusable.
Phoning Lotus Support was not an option - our company's budget is extremely tight, and having just laid off 4 employees, spending money for additional yearly support fees did not make a lot of sense, and was not an option at all. The company was living on a pay-cheque to pay-cheque basis.
I checked the administrator, and unfortunately, there was no % CPU usage stats for the linux box (Windows only feature?) having this working would have made it alot easier to determine what the CPU hog was. This is an unfortunate problem with the notes server (version? platform?) we were using :-( I hope lotus fixes this soon.
I noticed that the indexer was "stuck" on the new database we launched for the client. It never, ever ended the indexer - it was as though the indexer was stuck in a tight infinite loop. Once I stopped the compactor task, CPU performance immediately dropped down to normal levels, approx 35-50%. I came into work the next day and we had the same problem, however, this time the compactor was stuck on the database with the exact same problem, stuck in an infinite loop and driving CPU usage to 100%. We temporarily disabled the indexer & compactor so at least the server would be usable.
A few days later, random things started to happen. We would get unusual things like server error messages on the console/log that would say "not enough sockets, reducing maximum listeners by 1" (or something to that affect). This would continue until the server was unusable. We would also have random server crashes, freezes, and other problems. This would only affect domino, not the entire server itself, and would happen about every 30-40 minutes.
Every 2-3 days, the entire server would crash and require a hard reboot (OS crash).
I emergency "quarantined" the new client's database out of the /local/notesdata structure, and the 30-40 minute crashes and most performance issues vanished (however, the Linux version of domino definitely seems to be slower than the Windows one. Any ideas why??). The server still crashed every 2-3 days, which we eventually tracked down to a bad stick of ram that was launched in the server we built quickly as the replacement.
We took our backup/hotswap domino server running on an old custom built 300 Pentium 2, tossed some more ram into it, and placed the "quarantined client" onto it, with nothing else except basic configuration databases. Unfortunately for us, the client has a very cautious IT department, so they couldn't open the ports/ips to access the new server - we had to get them to setup passthru documents, which only compounded the already abysmal performance on their database.
Fast forward one month... every day spent trying to fix and isolate the problem without any luck.
While looking through the lotus support site, I noticed a document about "Orphaned Agent Data Notes". This document mentioned that there was a known issue with R5 (and R6 I assume) where hidden agent data note design elements would become orphaned, and cause "slow database opens". I ran the compagnt tool on the database and voila - no more server crashes and the database was extremely fast, in fact, the 4 gig database became one of our fastest performing databases, even faster than the 300 meg ones!
The lotus technote, as usual, downplayed the issue to try to save face. I find this problem in all of the lotus technotes - instead of trying to help the actual customer/user, they try to downplay the issue and pretend that Lotus didn't have a severe bug. Knowledge bases are about helping users, not marketing!!!!!
The knowledge base article about the Orphaned Agent Data Notes didn't mention anything about server crashes or the random issues we were having. We are 100% sure that the agent data notes were causing the problem - the second we cleaned them out of the database, all of the server problems went away.
Why was the built in compactor not fixed to check for and destroy the orphaned agent data notes? This would be an extremely benificial feature for us - right now we need to resort to running "compagnt" manually on over 300 databases - a very tedious task we now need to perform weekly because the core notes bug has not been fixed - orphaned agent data notes are still created in Notes 5 and Notes 6!
After allll that, it was still not over - now we have databases going corrupt, sets of 10-15 of them on our live server. Compact doesn't fix the problem - the databases are still corrupt after finishing the compact -c -i on the corrupt database. I even tried downgrading the entire server to Notes 5.0.10, and re-replicating in every database - the databases still went corrupt when the users started to use them.
It has definately been the worst 2.3 months in the company's history, and there is still no end in sight.
I also wrote a really cool failover & load balancing system which doesn't require more expensive notes licenses (clustering). I'll be posting this to OpenNTF.org soon. It even does heartbeat monitoring of notes 6 servers and text-messages the administrator's cell phones when servers go down!
Any thoughts? I've outlined a "grid" of problems we had at the bottom as a quick reference ;-)
-Eric B, frazzled IT person.
Problem | Cause | Notes Bug |
**Extremely** slow database performance, doc saves, doc opens, db open (took hours to open the db!) | Orphaned agent data documents This issue seems to be compounded when the database is a large size, > 1GB (or perhaps large numbers of documents) | Orphaned agent data documents are still created by designer & other unknown internal issues. No(?) attempt has been made by Lotus to make orphaned agent data documents stop impacting performance (they only tried to unsuccessfully fix the cause - the documents themselves) |
Random domino crashes, socket errors, etc.. approx every 1-4 hours | Orphaned agent data documents. 100% sure this wasn't caused by the bad ram in the server - it occured before we launched the new ram, and also after we put in the tested, good ram. | Seems to be a notes bug. I don't believe Lotus knows this occurs - I believe the scenarios Lotus experienced with orphaned agent data documents was milder than the one we had happen. |
Random OS crashes (linux kernel panics) during high usage, approx every 2-4 days | Bad memory chip. Our computer part supplier was selling extremely bad ram - we had approx 45% of the ram we order from them report as bad when we tested it after the fact. I wish we'd have time to test the ram before we shipped it out, but we needed to take action fast to try to solve the problems we were having | N/A |
Corruption occuring on 15-20 databases every morning, on our "Live" product databases the clients use, and our "Design Forum" databases which use a modified discussion template. Corruption says one of the following messages:
| Unknown. Any ideas? This is really causing is extreme amounts of downtime...... | A database should never go corrupt. This is a Lotus bug, but it's probably caused by something we did. A database shouldnt go corrupt even if we do something wacky ;-) This occurs in both 5.0.10 and 6.0.2CF1. |